Introduction

I always found it difficult to find perfectly organized, clean data on online websites. Because of this I decided to learn how to scrape webpages for the specific things that I needed. Whatever table, paragraph, or hidden image I wanted, I could grab off and import into R studio. From there, I can better clean and rearrange the data to my specific needs, producing a perfect image or data frame.

The Basics

The fist step in web scraping is reading in a webpage. This is done simply by: read_html(“webpage”). Once a webpage is read in, you can begin gathering data. The easiest way to import any data is to search for tables within a website: html_tables(), this gives you an output of every table within a webpage that you can then search through. By specifying a table: table <- tables[[1]], you can import an entire data set into R studio and begin your data manipulation.

If you are searching though paragraphs or images for multiple, specific terms you must use an HTML element. By downloading a plug-in titled: SelectorGadget, you are able to highlight any part of a webpage and grab the CSS or XPath element. When scraping an HTML element, you must read it in through its “path”. This allows R to scan an HTML document and look for the specific data that you want.





If you are curious where these paths are located, you can inspect an element. This brings up the actual HTML website code.


The First Webpage

A typical website will have either a built “table”, paragraph or a type of image explaining a certain topic. This is difficult to manipulate and gain any actual data from. My first website contained certain demographics about each world country. It looked a little like this: https://www.worldometers.info/geography/alphabetical-list-of-countries/

From the combined techniques listed above, I was able to import a list of countries, populations, land area, and density as a data.frame into R stuido.

The data that I collected from this webpage was in rough shape that required some cleaning and rearranging. Here is some code used:

The end product of my first web scraping is a scroll table that can be saved as an HTML document and imported into any website or presentation with listed Country and 2020 populations:

table output
Country Population (2020)
Afghanistan 38928346
Albania 2877797
Algeria 43851044
Andorra 77265
Angola 32866272
Antigua and Barbuda 97929
Argentina 45195774
Armenia 2963243
Australia 25499884
Austria 9006398
Azerbaijan 10139177
Bahamas 393244
Bahrain 1701575
Bangladesh 164689383
Barbados 287375
Belarus 9449323
Belgium 11589623
Belize 397628
Benin 12123200
Bhutan 771608
Bolivia 11673021
Bosnia and Herzegovina 3280819
Botswana 2351627
Brazil 212559417
Brunei 437479
Bulgaria 6948445
Burkina Faso 20903273
Burundi 11890784
Côte d'Ivoire 26378274
Cape Verde 555987
Cambodia 16718965
Cameroon 26545863
Canada 37742154
Central African Republic 4829767
Chad 16425864
Chile 19116201
China 1439323776
Colombia 50882891
Comoros 869601
Congo [DRC] 5518087
Costa Rica 5094118
Croatia 4105267
Cuba 11326616
Cyprus 1207359
Czech Republic 10708981
Congo [Republic] 89561403
Denmark 5792202
Djibouti 988000
Dominica 71986
Dominican Republic 10847910
Ecuador 17643054
Egypt 102334404
El Salvador 6486205
Equatorial Guinea 1402985
Eritrea 3546421
Estonia 1326535
Swaziland 1160164
Ethiopia 114963588
Fiji 896445
Finland 5540720
France 65273511
Gabon 2225734
Gambia 2416668
Georgia 3989167
Germany 83783942
Ghana 31072940
Greece 10423054
Grenada 112523
Guatemala 17915568
Guinea 13132795
Guinea-Bissau 1968001
Guyana 786552
Haiti 11402528
Vatican City 801
Honduras 9904607
Hungary 9660351
Iceland 341243
India 1380004385
Indonesia 273523615
Iran 83992949
Iraq 40222493
Ireland 4937786
Israel 8655535
Italy 60461826
Jamaica 2961167
Japan 126476461
Jordan 10203134
Kazakhstan 18776707
Kenya 53771296
Kiribati 119449
Kuwait 4270571
Kyrgyzstan 6524195
Laos 7275560
Latvia 1886198
Lebanon 6825445
Lesotho 2142249
Liberia 5057681
Libya 6871292
Liechtenstein 38128
Lithuania 2722289
Luxembourg 625978
Madagascar 27691018
Malawi 19129952
Malaysia 32365999
Maldives 540544
Mali 20250833
Malta 441543
Marshall Islands 59190
Mauritania 4649658
Mauritius 1271768
Mexico 128932753
Micronesia 548914
Moldova 4033963
Monaco 39242
Mongolia 3278290
Montenegro 628066
Morocco 36910560
Mozambique 31255435
Myanmar [Burma] 54409800
Namibia 2540905
Nauru 10824
Nepal 29136808
Netherlands 17134872
New Zealand 4822233
Nicaragua 6624554
Niger 24206644
Nigeria 206139589
North Korea 25778816
Macedonia [FYROM] 2083374
Norway 5421241
Oman 5106626
Pakistan 220892340
Palau 18094
Palestinian Territories 5101414
Panama 4314767
Papua New Guinea 8947024
Paraguay 7132538
Peru 32971854
Philippines 109581078
Poland 37846611
Portugal 10196709
Qatar 2881053
Romania 19237691
Russia 145934462
Rwanda 12952218
Saint Kitts and Nevis 53199
Saint Lucia 183627
Saint Vincent and the Grenadines 110940
Samoa 198414
San Marino 33931
São Tomé and Príncipe 219159
Saudi Arabia 34813871
Senegal 16743927
Serbia 8737371
Seychelles 98347
Sierra Leone 7976983
Singapore 5850342
Slovakia 5459642
Slovenia 2078938
Solomon Islands 686884
Somalia 15893222
South Africa 59308690
South Korea 51269185
Sudan 11193725
Spain 46754778
Sri Lanka 21413249
Sudan 43849260
Suriname 586632
Sweden 10099265
Switzerland 8654622
Syria 17500658
Tajikistan 9537645
Tanzania 59734218
Thailand 69799978
Timor-Leste 1318445
Togo 8278724
Tonga 105695
Trinidad and Tobago 1399488
Tunisia 11818619
Turkey 84339067
Turkmenistan 6031200
Tuvalu 11792
Uganda 45741007
Ukraine 43733762
United Arab Emirates 9890402
United Kingdom 67886011
United States 331002651
Uruguay 3473730
Uzbekistan 33469203
Vanuatu 307145
Venezuela 28435940
Vietnam 97338579
Yemen 29825964
Zambia 18383955
Zimbabwe 14862924

The Second Webpage

<<<<<<< HEAD

I wanted to make a more interactive display of the previous list of countries and so I chose to create a map. To complete this task, I needed the latitude and longitude of every country. Scraping the below table, I imported this information into an R data frame.

https://developers.google.com/public-data/docs/canonical/countries_csv

Through R code of cleaning, combining, and creating, I was able to produce an interactive map of the world. This map allows you to view every country and when clicked upon, displays the 2020 populations.

htmltools::includeHTML("Second_Webpage/map.html")
leaflet
=======

I wanted to make a more interactive display of the previous list of countries and so I chose to create a map. To complete this task, I needed the lat and long of every country. Scraping the below table, I imported this information into an R data frame.

https://developers.google.com/public-data/docs/canonical/countries_csv

Through R code of cleaning, combining, and creating, I was able to produce an interactive map of the world. This map allows you to view every country and when clicked upon, displays the 2020 population.

the below r chunk will be the map part. I am attempting to do this with htmltools::includeHTML(“Second_Webpage/map.html”) though it is not working. I have deleted it and put in this message so I can continue working. I will come back to this

>>>>>>> bf11315d990e756552b343fbedbe726a19fc8ef6

The Third Webpage

For my final push of learning web scraping, I wanted to be a little more creative. Scraping a list of the top most visited countries and tourist attractions I was able to create an interactive plot.

The website with the above information was a little more tricky. Using the HTML inspector code, I was able to pinpoint the top visited countries within the websites map and scrape the data into R studio. https://worldpopulationreview.com/country-rankings/most-visited-countries

<<<<<<< HEAD This was then combined through R code with a second CSV file of top visited tourist attractions to create the below table. This graphs shows the amount of tourist arrivals and when each bar is hovered over, the top tourist attraction in each country.

plotly
======= This was then combined through R code with a second CSV file of top visited tourist attractions to create the below table.

this bit of code freezes my webpage? Deleting for now but it was “htmltools::includeHTML(”Fourth_webpage/p.html")’

>>>>>>> bf11315d990e756552b343fbedbe726a19fc8ef6

Conclusion

Not only was I able to master the art of web scraping but I also learned some valuable packages such as XML2, rvest, janitor, KableExtra, HTMLWidgets, plotly